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Abstract 

;z: 

i This paper is concerned with the fast computation of Fourier integral operators of 

the general form J^^ e'^'"^'^^^'^'' f{k)dk, where A: is a frequency variable, <i>(a;, k) is a phase 
function obeying a standard homogeneity condition, and / is a given input. This is of 
interest for such fundamental computations are connected with the problem of finding 
I— I numerical solutions to wave equations, and also frequently arise in many applications 

including reflection seismology, curvilinear tomography and others. In two dimensions, 
^ when the input and output are sampled on N x N Cartesian grids, a direct evaluation 

requires 0{N'^) operations, which is often times prohibitively expensive. 
T-H This paper introduces a novel algorithm running in ©(iV^logiV) time, i. e. with 

near-optimal computational complexity, and whose overall structure follows that of the 
butterfly algorithm [30 . Underlying this algorithm is a mathematical insight concern- 
ing the restriction of the kernel e^''**(^''^) to subsets of the time and frequency domains. 
CD Whenever these subsets obey a simple geometric condition, the restricted kernel has 

approximately low-rank; we propose constructing such low-rank approximations using 
a special interpolation scheme, which prefactors the oscillatory component, interpolates 
^ ^ the remaining nonoscillatory part and, lastly, remodulates the outcome. A byproduct 

of this scheme is that the whole algorithm is highly efficient in terms of memory re- 
^ quirement. Numerical results demonstrate the performance and illustrate the empirical 

^ properties of this algorithm. 

Keywords. Fourier integral operators, the butterfly algorithm, dyadic partitioning, 
Lagrange interpolation, separated representation, multiscale computations. 
AMS subject classifications. 44A55, 65R10, 65T50. 



1 Introduction 

This paper introduces an efficient algorithm for evaluating discrete Fourier integral oper- 
ators. Let be a positive integer, which is assumed to be an integer power of 2 with no 
loss of generality, and define the Cartesian grids X = {{ii/N,i2/N),0 < ii,i2 < N} and 
O = {{ki,k2), —N/2 < ki,k2 < N/2}. A discrete Fourier integral operator (FIG) with 
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constant amplitude is defined by 



u{x) = e2-**(^'^)/(A;), xGX, (1.1) 
ken 

where {f{k), /c G $7} is a given input, {u{x),x £ X} is the output and as usual, ^ = \/— T. 
By an obvious analogy with problems in electrostatics, it will be convenient throughout the 
paper to refer to {f{k), /c € 0} as sources and {u{x),x G X} as potentials. Here, the phase 
function k) is assumed to be smooth in (x, k) for k ^ and obeys an homogeneity 
condition of degree 1 in k, namely, ^(x, Xk) = A^>(x, k) for each A > 0. 



A direct numerical evaluation of (1.1) at all the points in X takes 0{N'^) flops, which 
can be very expensive for large values of A^. Surveying the literature, the main obstacle 
to constructing fast algorithms for (1.1) is the oscillatory behavior of the kernel e27rj$(x,fc) 



when N is large, which prevents the use of the standard multiscale techniques developed 
in [3 El |26l [28]. Against this background, the contribution of this paper is to introduce a 
novel algorithm running in 0(A^log A) operations, where the constant is polylogarithmic 
in the prescribed accuracy e. 

1.1 General strategy 

Because the phase function k) is singular at /c = 0, the first step consists in representing 
the frequency variable k in polar coordinates via the transformation 

/2 

k = {ki, k2) = ^Npie'^^'P\ e2"P2 = (cos27r^j2,sin2^p2). (1-2) 

Here and below, the set of all possible points p generated from O is denoted by P, see 
Figure [T]^b) . Note that this transformation guarantees that each point p = (pi,P2) belongs 
to the unit square [0, 1]^ since —N/2 < ki,k2 < N/2. Because of the homogeneity of 
the phase function $ may be expressed in polar coordinates as 

/2 

<^{x,k) = N^<^{x,e'^'''P^)pi ■.= N^{x,p). 

Since $(x, k) is smooth in (x, k) for k ^ 0, ^(x,p) is a smooth function of {x,p) with x and 
p in [0,1]2. 



With these notations, we can reformulate the computational problem (1.1) as 



n(x) = j;e™(-'^)/(p), XGX, 
peP 

in which the sources {f{p)} are now indexed by p instead of k. As we just mentioned, the 
main issue is that the kernel function e27rj7V<i'(x,p) jg j^ighly oscillatory. Our approach relies 
on the observation that this kernel, properly restricted to time and frequency subdomains, 
admits accurate and low-order separated approximations. To see why this is true, consider 
two square boxes A and B in [0, 1]^ centered at xq{A) and po{B), and suppose that the 
sidelengths w{A) and w{B) obey the relationship w{A)w{B) < 1/N. Introduce the new 
function 

i?^^(x,p) := ^{x,p) - ^{xo{A),p) - ^{x,po{B)) + ^ {xo{A) , po{B)) , (1.3) 
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Figure 1: The point distribution and hierarchical partitioning (at a fixed level) for = 64. 
(a) the set X. (b) the set P in polar coordinates, (c) the frequency partitioning in Cartesian 
coordinates {k G $7). 

for each x ^ A and p G B, and decompose the kernel e^'^*^*(^'P) as 



In ( 1.4 ), we note that each of the first three terms depends on at most one variable (x or p). 
Recall now the standard multi-index notation; i and j are multi-indices and for i = {11,12), 
ii,i2 > 0, l^l =11+12 and for x = {xi,X2), = x^iX^2- Applying the mean value theorem 



to R'^^{x,p) successively in p and x gives 

lil=i 



< sup 5] |9^[^(x,/)-M/(xo(A),j,*)]| |(p-j,o(i?))^| 

p*e_B 



< sup sup E \dldi,^{x*,p*)\ \{x-xo{A)y\\{p-p^[B)y\ 

= 0{1/N). (1.5) 
The last equation follows from the smoothness of ^ and from the assumption w{A) w{B) < 



1/N. To summarize, (1.5) gives 2ttN R^^{x,p) = 0(1) and, therefore, the complex expo- 
nential e^'^*^^ (^'P) is nonoscillatory. 

Under some mild smoothness condition, this observation guarantees that for any fixed 
accuracy e, there exists a low-rank separated approximation of e^'^*^-^ i^<p)^ valid over 
A X B, effectively decoupling the spatial variable x from the frequency variable p. We 
propose constructing this low-rank approximation using a tensor-product Chebyshev in- 
terpolation of the function g^'^^^^ i^-.p) in the x variable when w{A) < l/\fN, and in 



the p variable when w{B) < 1/yN. Since the first three terms in (1.4) depend on at 
most one variable, one also has a separated approximation of g'^'^^^'^i^^p) with exactly the 



same separation rank. Looking at (1.4), the resulting low-rank approximation of the kernel 
^2inN'S/{x,p) pg^j^ viewed as a special interpolation scheme that prefactors the oscillatory 
component, interpolates the remaining nonoscillatory part, and finally appends the oscilla- 
tory component. As we will see later, the separation rank providing an e-approximation, 
for any fixed e, is bounded from above by a constant independent of N. Further, if we 
define the potential generated by the sources p inside B for any fixed box B as 

^^(x) = ^e™(-'rt/(p), (1.6) 

peB 
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then the existence of such a separated approximation imphes the existence of a compact 
expansion for the restriction of u^{x) to A, {u^{x), x G A}, of the form 



(1.7) 



In (1.7), the number r of expansion coefficients Sj^^ is independent of A*" for a fixed relative 
error e, as we will see later. 

The problem is then to compute these compact expansions. This is where the basic 
structure of the butterfly algorithm |30| l3T] is powerful. A brief overview is as follows. We 
start by building two quadtrees Tx and Tp (see Figure [T| a) and (b)) respectively in the 
spatial and frequency domains with leaf nodes at level L = log2 N. For each leaf node 
B € Tp, we first construct the expansion coefficients for the potential {u^{x),x G A} 
where A is the root node of Tx ■ This can be done efficiently because i? is a very small box. 
Next, we go down in Tx and up in Tp simultaneously. For each pair (A, B) with A at the 
l-th level of Tx and B at the (L — i)-th level of Tp, we construct expansion coefficients 
for {u^{x),x S A}. As shall see later, the key point is that this is done by using the 
expansion coefficients which have been already computed at the previous level. Finally, we 
arrive at level £ = L, i. e. at the root node of Tp. There u^{x) = u{x), and since one has 
available all the compact expansions corresponding to all the leaf nodes A of Tx , one holds 
an approximation of the potential u{x) for all x ^ X. 

1.2 Applications 



The discrete equation (1.1 ) naturally arises as a numerical approximation of a continuous- 



time FIO of the general form 

u{x)= / a{x,k)e^^''^^''^^'^f{k)dk. (1.8) 



Note that in (1.1), the problem is simplified by setting the amplitude a{x,k) to 1. The 
reason for making this simpler is that in most applications of interest, a{x,k) is a much 
simpler object than the term e27r«*(z,fe)_ instance, a{x, k) often has a low-rank separated 
approximation, which is valid in X M2 and yields a fast algorithm [51 [Tl] . Hence, setting 
a{x, k) = 1 retains the essential computational difficulty. 



A significant instance of (1.8) is the solution operator to the wave equation 

uu{x, t) — c^Ati(x, t) = 

with constant coefficients and x G M^. With initial conditions of the form u{x,0) = uo{x) 
and ut{x, 0) = 0, say, the solution u{x, t) at any time t > is given by 

u{x,t)=^(f e2^*(^-^+"l'=l*)uo(A:)dA; + / e2^*(^-'=-^l'=l*)no(A;)(ifc ) , 

where mq is the Fourier transform of uq. Clearly, this is the sum of two FIOs with phase 
functions ^±{x,k^ = x ■ kziz c\k\t and amplitudes a±(x,k) = 1/2. Further, FIOs are still 
solution operators even in the case of inhomogeneous coefficients c(x) as in 

utt{x, t) — c'^{x)Au{x, t) = 0. 
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Indeed, under very mild smoothness assumptions, the solution operator remains the sum 
of two FIOs, at least for sufficiently small times. The only difference is that the phases and 
amplitudes are a little more complicated. In particular, the phase function is the solution 
of a Hamilton- Jacobi equation which depends upon c{x). 

Another important example of FIO frequently arises in seismics. A fundamental task in 
reflection seismology consists in producing an image of the sharp features of an underground 
medium from the seismograms generated by surface explosions. In a nutshell, one builds an 
imaging operator which maps variations of the pressure field at the surface into variations 
of the sound speed of the medium (large variations indicate the presence of refiectors). This 
imaging operator turns out to be an FIO [51 [H]. Because FIOs are hard to compute, several 
algorithms with various degrees of simplification have been proposed, most notably Kirch- 
hoff migration which approximates the imaging operator as a generalized Radon transform 
in |3Sj . Computing this transform still has a relatively high complexity, namely, of order 
N'^ in 2D. In contrast, the algorithm proposed in this paper has an optimal 0(iV^ log A^) 
operation count, hence possibly offering a significant speedup. 



1.3 Related work 

Although FIOs play an important role in the analysis and computation of linear hyperbolic 
problems, the literature on fast computations of FIOs is surprisingly limited. The only work 



addressing (1.1 ) in this general form is the article [TT] by the authors of the current paper. 
The operative feature in [XT] is an angular partitioning of the frequency domain into \/iV 
wedges, each with an opening angle equal to 2tt/^/N . When restricting the input to such a 
wedge, one can then factor the operator into a product of two simpler operators. The first 
operator is provably approximately low-rank (and lends itself to efficient computations) 
whereas the second one is a nonuniform Fourier transform which can be computed rapidly 
using the nonuniform fast Fourier transform (NFFT) [H [231 ES] • The resulting algorithm 
has an O(A^'^logA) complexity. 

In a different direction, there has been a great amount of research on other types of 
oscillatory integral transforms. An important example is the discrete n-body problem where 
one wants to evaluate sums of the form 

q-jK{\x-x,\), K{r) = e'^^/T 

in the high-frequency regime (u; is large). Such problems appear naturally when solving 
the Helmholtz equation by means of a boundary integral formulation [16i 117) . A popular 
approach seeks to compress the oscillatory integral operator by representing it in an ap- 
propriate basis such as a local Fourier basis, or a basis extracted from the wavelet packet 
dictionary j2l 13 |22l [29]. This representation sparsifies the operator, thus allowing fast 
matrix-vector products. In spite of having good theoretical estimates, this approach has 
thus far been practically limited to ID boundaries. One particular issue with this ap- 
proach is that the evaluation of the remaining nonnegligible coefficients sometimes requires 
assembling the entire matrix, which can be computationally rather expensive. 

To the best of our knowledge, the most successful method for the Helmholtz kernel 
n-body problem in both 2 and 3D is the high-frequency fast multipole method (HF-FMM) 
proposed by Rokhlin and his collaborators in a series of papers |34| [35l [13] . This approach 
combines the analytic property of the Helmholtz kernel with an FFT-type fast algorithm 
to speedup the computation of the interaction between well-separated regions. If A^^ is the 
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number of input and output points as before, the resulting algorithm has an 0(iV^ log A^) 
computational complexity. Other algorithms using similar techniques can be found in 



Finally, the idea of butterfly computations has been applied to the n-body problem 
in several ways. The original paper of Michielssen and Boag [30] used this technique to 
accelerate the computation of the oscillatory interactions between well-separated regions. 
More recently, Engquist and Ying j24| |25] proposed a multidirectional solution to this 
problem, where part of the algorithm can be viewed as a butterfly computation between 
specially selected spatial subdomain. 

1.4 Contents 

The rest of this paper is organized as follows. Section [2] describes the overall structure of the 
butterfly algorithm. In Section|3j we prove the low-rank property of the kernel and introduce 
an interpolation based method for constructing low-rank separated approximations. Section 
|4] develops the algorithm by incorporating our low-rank approximations into the butterfly 
structure. Numerical results are shown in Section |5] Finally, we discuss related problems 
for future research in Section [6l 



We begin by offering a general description of the butterfly structure and then provide 
several concrete examples. This general structure was originally introduced in fSO], and 
later generalized in |31j . 

In this section, X and P are two arbitrary point sets in M'^, both of cardinality M. We 
are given inputs G -P} and wish to compute the potentials {u{x),x G X} defined 



where K{x,p) is some kernel. Let Dx 3 X and Dp D P be two square domains containing 
X and P respectively. The main data structure underlying the butterfly algorithm is a pair 
of dyadic trees Tx and Tp. The tree Tx has Dx as its root box and is built by recursive, 
dyadic partitioning of Dx until each leaf box contains only a small number of points. The 
tree Tp recursively partitions Dp in the same way. With the convention that the root nodes 
are at level 0, one sees that under some uniformity condition about the point distributions, 
the leaf nodes are at level L = O(logM). Throughout, A and B denote the square boxes 
of Tx and Tp, i{A) and i{B) denote their level. 

The crucial property that makes the butterfly algorithm work is a special low-rank 
property. Consider any pair of boxes A £ Tx and B £ Tp obeying the condition i{A) + 
i{B) = L; we want the submatrix {K{x,p),x £ A,p £ B} (we will sometimes loosely 
refer to this as the interaction between A and B) to be approximately of constant rank. 
More rigorously, for any e, there must exist a constant independent of M and two 
sets of functions {af^{x),l < t < r^} and {/5^^(p),l < t < r^} such that the following 
approximation holds 



[lailSllIHlEe]. 



2 The Butterfly Algorithm 



by 





(2.1) 
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The number is called the e-separation rank. The exact form of the functions {af^{x)} 
and of course depends on the problem to which the butterfly algorithm is applied, 

and we will give two examples at the end of this section. 

Recalling the definition u^{x) = X^^g^ K{x,p)f{p), the low-rank property gives a com- 
pact expansion for {u^{x),x E ^4} as summing (2.1) over p & B with weights f{p) gives 



u^{x)-Y,c^i 



AB/ 



t=l 



< 



\f{p)\\e, VxgA 



Therefore, if we can find coefficients {Sf^^}t obeying 



MB 



p&B 



AB 



ip)m, 



then the restricted potential {u^{x),x G A} admits the compact expansion 



(2.2) 



AB 



t=l 



< 



\m\ 



Vx G A. 



We would like to emphasize that for each pair {A, B), the number of terms in the expansion 
is independent of M. 

Computing {5/^-^,1 < t < r^} by means of (2.2) for all pairs A,B is not efficient 
when S is a large box because for each B, there are many paired boxes A. The butterfly 
algorithm, however, comes with an efficient way for computing {6f^} recursively. The 
general structure of the algorithm consists of a top down traversal of Tx and a bottom up 
traversal of Tp, carried out simultaneously. Postponing the issue of computing the separated 
expansions, i.e. {af^{x)} and this is how the butterfly algorithm operates. 

1. Preliminaries. Construct the trees Tx and Tp with root nodes Dx and Dp. 

2. Initialization. Let A be the root of Tx. For each leaf box B of Tp, construct the 
expansion coefficients {Sf^^ , 1 < t < r^} for the potential {u^{x),x G A} by simply 



setting 



cAB 



p€B 



AB 



{P)f{p)- 



(2.3) 



3. Recursion. For £ = 1,2 
pair {A, B) with 1{A) 



, L, visit level £ in Tx and level L — £ in Tp. For each 
and £{B) = L — I, construct the expansion coefficients 
{5i^^ ,1 < t < r^} for the potential {n^(x),x G A}. This is done by using the 
low-rank representation constructed at the previous level (£ = is the initialization 
step). Let Ap be ^'s parent and {Be} be B's children. At level i — 1, the expansion 
coefficients {6^''^''}t' of {u^''{x),x G Ap} are readily available and we have 



.B, 



{x)-Y,ai^''^{x)5^^''' 



t'=i 




\f{p)\ \e, Vx G Ap. 



Since u^{x) = X^c^^^C^)' previous inequality implies that 



.B. 



.B, 



A„B, 



''"^(x)(57 



t' 



c t'=l 



< 



\P^B 



Vx G Ar. 
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Since A C Ap, the above approximation is of course true for any x £ A. However, 
since £{A) + £{B) = L, the sequence of restricted potentials {u^{x),x G A} also has 
a low-rank approximation of size r^, namely, 



■peB 



^{x)-Y^at^{x)6t 
t=i 

Combining these last two approximations, we obtain that {5f^}t should obey 

f; ar{x)5t^ ^Y.f: Vx G A. (2.4) 

t=l c t'=l 

Since this is an overdetermined linear system for {^t^^^t when {S^^^^^t' ,c ^-re available, 
one possible approach to compute is to solve a least-squares problem but 

this can be very costly when \A\ is large. Instead, the butterfly algorithm uses an 
approximate linear transformation mapping {5^''^''}t',c into {^/^'^If, which can be 
computed efficiently. We will discuss how this is done in several examples at the end 
of this section. 

4. Termination. Now I = L and set B to be the root node of Tp. For each leaf box 
A G Tx, use the constructed expansion coefficients {(^/^^}t to evaluate u{x) for each 
X G A, 

u(x)=f]a^^(x)5,^^. (2.5) 
t=i 

A schematic illustration of the algorithm is provided in Figure |2} We would Uke to 
emphasize that the strict balance between the levels of the target boxes A and source boxes 
B maintained throughout the procedure is the key to obtaining accurate low-rank separated 
approximations. 

Leaving aside the computations of the separated expansion and taking for granted that 
constructing {S{^^}t for each pair (A, B) has, in principle, the complexity of applying a lin- 
ear transform of size 0(re x rg), observe that the butterfly algorithm has low computational 
complexity. To be sure, the construction of Tx and Tp clearly takes at most 0(M log M) 
operations. The initialization and termination steps take at most 0{r^ M) as these steps 



require at most 0{r^) operations per point, see (2.3) and ( |2.5| ). The main workload is 
of course in the recursion step. At each fixed level the number of pairs {A, B) under 
consideration is of order 0{M). It follows from our assumption that the number of flops 
required to compute all the coefficients {6f^}t at each level £ is just 0{r1 M). Since there 
are only about logM levels, the number of operations in the recursion is at most of the 
order of 0{r^ MlogM). In conclusion, the overall operation count is 0{r1 MlogM). 

The general structure of the butterfly algorithm should be clear by now but we have 
left out two critical pieces, which we would need to address to apply it to specific problems. 



1. What are the functions {a^^(x)} and {/5f^^(p)} in the low-rank approximation (2.1) 
and how are they computed? 



2. How to solve for {5{''^}t from ([2^? 



The rest of this section discusses answers in two distinct examples. 
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Top down Tx Tp Bottom up 

Figure 2: Schematic illustration of the butterfly algorithm in 2D with 4 levels (L = 3). 
The tree Tx is on the left and Tp is on the right. The levels are paired as indicated so 
that the product of the sidelengths remains constant. The red line pairs two square boxes 
A and B at level 2 (shaded in gray); low-rank approximations of the localized kernel and 
expansion coefficients for the localized potential are computed for each such pair. The 
algorithm starts at the root of Tx and at the bottom of Tp. It then traverses Tx top down 
and Tp bottom up, and terminates when the last level (the bottom of Tx) is reached. The 
figure also represents the four children of any box B. 

Example 1. In [31,, O'Neil and Rokhlin apply the butterfly algorithm to several special 
function transforms in one dimension. Suppose that is a positive integer. In this setup, 
Dx = Dp = [0, A^], X and P are two sets of M = 0{N) points distributed uniformly or 
quasi-uniformly in [0, A^], and the kernel K(x,p) parametrizes some special functions. For 
example, in the case of the Fourier transform, K(x,p) = q^'^^^p/^ so that p parametrizes 
a set of complex sinusoids. The trees Tx and Tp are recursive dyadic partitions of [0, N] 
until the leaf nodes are of unit size. In this work, all the kernels under study have low-rank 
approximations when restricted to any pair A £ Tx and B £ Tp obeying i{A) + i{B) = 
L = log2 N. 

The main tool for constructing the low-rank approximation is the interpolative decom- 
position proposed in |27| [T¥| . Given an m x n matrix Z which is approximately of rank r, 
the interpolative decomposition constructs an approximate factorization Z ~ ZcR, where 
the matrix Zc consists of a subset of r columns taken from the original matrix Z and the 
entries of R have values close to one. Such a decomposition requires 0{mv?') operations 
while storing the matrix R requires 0{rn) memory space. Applying this strategy to the 
kernel -fC(x,p) with x £ A and p £ B implies that the functions {af^{x), 1 < t < r} are of 
the form {K{x,Pi'^), 1 <t < r} with {pf^} C B and the functions {l3t^^{p), 1 < t < r} are 
given by the corresponding entries in the matrix R. Due to the special form of {af^{x)}, 
the coefficients {Sf^}t are often called equivalent sources. 

Now that we have addressed the computations of {af^{x)} and {Pf^^{p)}, it remains to 
examine how to evaluate the coefficients {Sf^}. In the butterfly algorithm, these coefficients 



are computed in the initialization step ( |2.3[ ) and in the recursion step (2.4). Initially, A 
is the root box of Tx and S is a leaf box of Tp. To compute 1 < t < r} in the 

initialization step, construct the interpolative decomposition for K{x,p) with x £ A and 
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p B obeying 



t=i 



<e, yxeA,ypeB. 



(2.6) 



Since each leaf box B contains only a constant number of points p, constructing the in- 
terpolative decomposition requires 0{N) operations and ©(r^) memory space for each B. 
Since there at most 0{N) of these boxes, the computational costs scales at most like 0{N'^). 
Then we simply compute {Sf^, 1 < i < via (2.3). Once the interpolative decomposition 
is available, this requires 0{reN) operation for all pairs at the 0th level. 

As for (2.4), the special form of the functions {af'^{x), 1 < t < r^} allows rewriting the 
right-hand side as 



X,Pt. 



c t'=l 



As a result, we can treat this quantity as the potential generated by the equivalent sources 
{^t'^ ''}c,t' located at {p^,^ ''}c,t'- In order to find {Sf^,l <t< r^}, construct the inter- 
polative decomposition of K{x,p) with x G vl and p G {p^i^ ''}c,t': namely. 



K{x,p)-J2K{x,pf^)l3f''{p) 



t=i 



<s, yx £ A,yp £ {p^"^^} 



c,t'- 



(2.7) 



A B 

Since the numbers of points in {p^,'' ''}c,t' is proportional to r^, this construction requires 
0(r£ 1^1) and requires O(r^) memory space per pair {A,B). Summing (2.7) over p G 
{Pf''^''}c,t' with weights {S^''^''}c,t' gives a way to compute {S{^^}t- Indeed, one can set 



1 < t < r,. 



From the above discussion, we see that the butterfly algorithm described in ^31] requires 
a precomputation step to generate interpolative decompositions for 



K{x,p) for X E A where A is the root node and p G B for each leaf node (2.6), 



A B 

• and K{x,p) for x G A and p G {p^i^ ''}c,t' for each pair {A, B) with t{A) = 1, 2, . . . , L 
and t{A)+i{B) = L ^Tf). 

A simple analysis shows that these "precomputations" take 0{r1 N"^) operations and require 
0(rg A^logA^) memory space. The quadratic time is very costly for problems with large 
N . This might be acceptable if the same Fourier integral operator were applied a large 
number of times. However, in the situation where the operator is applied only a few times, 
the quadratic precomputation step becomes a huge overhead, and the computational time 
may even exceed that of the direct evaluation method. Moreover, the storage requirement 
quickly becomes a bottleneck even for problems of moderate sizes as in practice, the constant 
r1 is often nonnegligible. 



Example 2. In |39], the butterfly algorithm is used to develop a fast algorithm for sparse 
Fourier transforms with both spatial and Fourier data supported on curves. Suppose N \s a, 
positive integer. In this setting, Dx = Dp = [0, N^, and X and P are two set of M = 0{N) 
points supported on smooth curves in [0, A^]^. The kernel is given by K{x,p) = g^Triz-p/Af^ 
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The quadtrees Tx and Tp are generated adaptively in order to prune branches which do 
not intersect with the support curves. The leaf boxes are of unit size and L = log2 N . For 
any pair of boxes A G Tx and B S Tp with ^{A) +i(B) = L, it is shown that the restricted 
kernel K{x,p) is approximately low-rank. Here, the functions {af^{x),l < t < r} take 
the form {K{x,p^),l < t < r} where {p^} is a tensor-product Chebyshev grid located 
inside the box B. The coefficients {Sf^^} — also called equivalent sources — are constructed 
by collocating (2.4) on a tensor-product Chebyshev grid inside the box A. 

Due to the tensor- product structure of the grid {p^ , 1 < i < "r} and the special form 
of the kernel K{x,p) = e^'^*^'^/^, one can compute {Sf^} via a linear transformation 
which is essentially independent of the boxes A and i?. As a result, one does not need the 
quadratic-time precomputation step and there is no need to store explicitly these linear 
transformations. We refer to [39] for more details. 

As we shall see, the algorithm introduced in this paper also makes use of tensor-product 
Chebyshev grids, but the low-rank approximation is constructed through interpolation 
rather than through collocation. Before discussing other similarities and differences, how- 
ever, we first need to introduce our algorithm. 



3 Low-rank Approximations 

Recall that our problem is to compute 

for all X £ X, where X and P are the point sets given in Figure [TJ^a) and (b). As both 
X and P are contained in [0, 1]^, we set Dx = [0, 1]^ and likewise for Dp. Then the two 
quadtrees Tx and Tp recursively partition the domains Dx and Dp uniformly until the 
finest boxes are of sidelength 1/A^. 



3.1 The low-rank property 



We assume that the function ^{x,p) is a real-analytic function in the joint variables x and 
p. This condition implies the existence of two constants Q and R such that 



sup \didi^{x,p)\<QiljlR-\'\- 



■lil 



where i = (11,12) and j = (ji, 72) are multi-indices, i! = ii! and \i\ = ii + 12- For instance, 
the constant R can be set as any number smaller than the uniform convergent radius of the 
power series of Following jTl], we term these functions {Q, R)- analytic. 

The theorem below states that for each pair of boxes (A, B) G Tx x Tp obeying 
w{A)w{B) = 1/A^, the submatrix i^^^'^i-^^^^i^^p) £ A,p £ B} is approximately low- 
rank. Throughout, the notation f ^ g means f < Cg for some numerical constant C 
independent of N and £. 

Theorem 3.1. Let A and B he boxes in Tx and Tp obeying w{A) w{B) = 1/N. For any 
e < Eq and N > Nq, where Sq and Nq are some constants, there exists an approximation 
obeying 



^2mNR^B{x,p) 



E 

t=i 



< e 
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with < log^(l/e). Moreover, 



• when w{B) < 1/^/N, the functions {Pf'^{p)}t can all be chosen as monomials in 
{p — Po{B)) with a degree not exceeding a constant times log^(l/e), 



and when w{A) < 1/\/N , the functions {a^^(x)}i can all he chosen as monomials 
in {x — xq{A)) with a degree not exceeding a constant times log^(l/e). 



The proof of Theorem 3.1 uses the following elementary lemma (see |TT] for a proof). 
Lemma 3.2. For each 2:0 > and e > 0, set Ss = [max(2e2:o,log2(l/e))] . Then 



tl 



< e, V|z| < zq. 



Proof of Theorem 3.1 Below, we will drop the dependence on A and B in xo(^) and po{B) 
for A and B are fixed boxes. Since w{A)w{B) = 1/N, we either have w{A) < l/\/]V or 
w{B) < or both. Suppose for instance that w{B) < I/^/N. Then 

= ^{x,p) - ^'(xcp) - ^{x,pq) + ^'(xo,po) 
= [^{x,p) - ^{xQ,p)] - [^{x,pq) - ^'(xo,po)] 
= Hx{p) - H^{po), 

where Hx{p) := — ^'(xcp); the subscript indicates that we see H as a function 

of p and think of x as a parameter. The function R^^{x,p) inherits the analyticity from 
^{x,p), and its truncated Taylor expansion may be written as 



R [x,p)= 2^ ^ [P-Po] + 2^ -^^^j [P-Po] 

l<\i\<K ' \i\=K 

where p* is a point in the segment [pO)P]- For each i with \i\ = K, we have 



(3.1) 



d;H,{p*) = ^ did;^ix*,p*){x - xoY, 

lil=i 

for some point x* in [xo,x] and, therefore, it follows from the (Q, i?)-analycity property 
that 



d;H,{p* 



ip-Po) 



<2QR~<-^+^'>w{A) {w{B)f < 2QR-'-, i ^) 



N\RJ 



Since w{B) < 1/VN, 1/VN < R/2 w{B)/R < 1/2 and, therefore, for N sufficiently 
large, 

d'pH^ip*) 



-{p-Po) 



N ■ 



Because there are at most K + 1 terms with \i\ = K, it follows that 

< tt{K + l)QR-'^2-^^-^\ 



2nN 



R [x,p)- 2^ -^^^^ [p-Po) 



l<\i\<K 



(3.2) 
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Set 



= Colog(l/e). 



(3.3) 



Then if Co is a sufficiently large numerical constant, the right-hand side of (3.2) is smaller 
than e and, therefore. 



27riV|i?^^(x,j)) - A(x,p)| < e, A{x,p) :-- 



Noting that I e*' 



(3.4) 



i<|i|<i<: 



e I < |a — 6|, we see that in order to obtain an e-accurate separated 
approximation for ^^'^^^R^^i^^v) ^ only need to construct one for g27r«A^A(x,p) _ q^^, p^g^j^ 
is to invoke Lemma 3.2 To do this, we need an estimate on A{x,p). When K = 1, the 
estimate in (3.2) provides a bound of R^^{x,p) 

27r N\R^^{x,p)\ < SttQR-^. 



Combining this estimate with (3.4) yields 



27riV|^(x,p)| < 2ttN\R^'^{x,p)\ + e< SttQR-^ + e. 
By taking e small enough, we can assume 

2e{8TTQR-'^ + e) < log2(l/e), 



and Lemma 3.2 gives a log(l/e)-term e-accurate approximation 

log(l/e)-l 



^2mNA{x,p) 



E 

t=0 



{27rNA{x,p)Y 



< £. 



Expanding (27rz^(x,p))* for each t gives a sum in which each term is a function of x times 
a monomial {p — Po)^ of degree \k\ < log^(l/e). Since there are at most 0(log^(l/e)) 
different choices for the multi-index k in the expanded formula, combining the terms with 
the same multi-index k yields an 0(log'^(l/e))-term 2e-accurate separated approximation 
for q'^'^i-^R (^'P) with factors {I3t^^{p)} of the form {p — po)^ as claimed. 

We studied the case w{B) < l/\fN but the method is identical when w{A) < l/^/~N. 
Write 



i?^^(x,p) = [^{x,p) - ^{x,pq)] - [^{xQ,p) - ^{xq,pq)\ 

and follow the same procedure. The resulting approximation has 0(log^(l/e)) terms, but 
the factors {af'^{x)} are now of the form [x — xq)^ with |A;| < log^(l/e). □ 

shows that the e-rank of ^'^'^^^R'^^i^^v) ig bounded by a constant multiple 



3.1 



Theorem 

of log^(l/e) for a prescribed accuracy e. Since 



^{x,p) = ^{x,pq) + ^{xQ,p) - ^{xo,pq) + i?^^(x,p), 

a direct consequence is that {e27nAf'i'(a:,p)^ ^ g g ^| bag a separated approximation 
of the same rank. A possible approach to compute these approximations would be to 
use the interpolative decomposition described in Example 1 of Section |2j However, this 
method suffers from two main drawbacks discussed in that section limiting its applicability 
to relatively small problems. This is the reason why we propose below a different and faster 
low-rank approximation method. 
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3.2 Interpolation gives good low-rank approximations 



The proof of Theorem 3.1 shows that when w{B) < 1/VN, the p-dependent factors in the 
low-rank approximation of e^'^*^-^ ^^'P^ are all monomials in p. Similarly, when w{A) < 
1/^/N, the x-dependent factors are monomials in x. This suggests that an alternative to 
obtain a low-rank separated approximation is to use polynomial interpolation in x when 
w{A) < 1/VN, and in p when w{B) < 1/Vn. 

For a fixed integer q, the Chebyshev grid of order q on [—1/2, 1/2] is defined by 



1 

z-i = - cos 
2 



ITT 



1 



0<i<g-l 



We use this to define tensor-product grids adapted to an arbitrary squared box with center 
c and sidelength w as 

{c + w{zi^,Zi^),ii,i2 = 0, 1, ... ,0^ - 1}. 

Given a set of grid points {zi G R, < i < q — 1}, we will also consider the family of 
Lagrange interpolation polynomials Li taking value 1 at Zi and at the other grid points 



Li{z; {zi}) 



n 



Z — Zi 



0<j<g-l,jf^i 



For tensor-product grids {zi^i-^} x {^2,22}) we define the 2D interpolation polynomials as 
Li{z,{zi}) = Li^(zi,{zi,iJ)Li2 (22, {22,12}); ^ = ihjk)- 

The theorem below shows that Lagrange interpolation provides efficient low-rank ap- 
proximations. In what follows, is the 2D Lagrange interpolation polynomial on the 
Chebyshev grid adapted to the box B. 



Theorem 3.3. Let A and B be as in Theorem 3.1. Then for any £ < Eq and N > Nq 

where £0 and Nq are the constants in Theorem 3.1 there exists q^ < log^(l/e) such that 



when w{B) < l/\fN , the Lagrange interpolation of e27r«Affi {x,p) p a x q^ 
Chebyshev grid {pf^ adapted to B obeys 



<e, yx e A,yp e B, (3.5) 



and when w{A) < 1/y/N, the Lagrange interpolation of e^'^^^^^^^^^'P^ in x on a q^ xq^ 
Chebyshev grid {xf} adapted to A obeys 



XI e 



2mN R-''^ [x^ ,p) 



<e, VxG^,VpG5. (3.6) 



Both (3.5) and (3.6) provide a low-rank approximation with r^ = q^ < log^(l/e) terms. 



The proof of the theorem depends on the following lemma. 
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Lemma 3.4. Let /(yi, 1/2) G ^"([0, 1]^) and Vq be the space spanned by the monomials y"^y2^ 
with < Qi,a2 < Q'- The projection operator mapping f into its Lagrange interpolant on 
the q X q tensor-product Chebyshev grid obeys 

||/-n,/||<(l + Clog2g) inf 

g<^Vq 



for some numerical constant C , where 



The proof of this lemma is a straightforward generalization of the one dimensional case, 
which can be found in [33j. 



Proof of Theorem 3^. Suppose that w{B) < l/v-/V and pick q^ = Klog{l/e) where K = 
Colog(l/e) is given by (3.3) in the proof of Theorem 3.1 We fix x G A, and view 



^2mNR^^{x,p) ^ function of p G B. Applying Lemma 

^-^^R'^'^M - n^^e2.^iVi?^^{-,-) < (1 + Clog2 q, 

Theorem 



3.4 



inf 



gives 



3.1 



states that the functions {Pf^^ {p)}t are all monomials of degree less than q^. 
Therefore, for a fixed x, the low-rank approximation in that theorem belongs to Vq^, and 
approximates e27rj7VH^^(x,-) -^^ii^j^jj^ £_ Combining this with the previous estimate gives 

,2«Afi?^S(x,) _ ji^^^2ntNR^B{x,-) < ^ (^iQg2 e < {Ci + C2 log2(log(l/e))) £, (3.7) 

where Ci and C2 are two constants independent of N and e. The same analysis applies to 
the situation where w{A) < l/y/N; fix p E B and view g^'^*^^ ^''P^ as a function of x G A, 
repeat the same procedure and obtain the same error bound. 



The estimate (3.7) and its analog when w{A) < l/vN are the claims ( |3.5| ) and (3.6) 
but for the fact that the right-hand side is of the form (Ci -|- C2 log^(log(l/e))) e rather 
than e. In order to get rid of the Ci -|- C2 log^(log(l/e)) factor, we can repeat the proof 
with £^^~^^^ with a small 5 > 0. As only depends on e logarithmically, this only increases 
qe by a small constant factor. □ 

Finally, to obtain a low-rank approximation for the real kernel e^'^*^*(^'P) when w{B) < 
1/VN, multiply ([33| with e2'^*^*(^0'P) e2'^*^*(^''Po) e-2'^»^*(^o,Po) use again xq and po 
as shorthands for xq{A) and po(B)) which gives 



t 



^2-KiN^{xo,p) 



< e, \/xGA,\/pGB. 



In terms of the notations in (2.1), the expansion functions are given by 

a^^[x) = e2-*^*(^'Pf ), = e-2«^*{^o,pf ) ^B(^) ^2.^N^(x,,p) _ (3 8) 

This is a special interpolant of the function e2'^*^*(^''P) in the p- variable which 1) prefactors 
the oscillation 2) performs the interpolation and 3) remodulates the outcome. Following 
(2.2), the expansion coefficients {(5/^'^}t for the potential {u^{x),x S A} should then obey 



the condition 



2-KiN^{xo,p) 



(3.9) 



peB 
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When w{A) < 1/Vn, multiply Q with e^'^^^^^^O'P) e^^'^^^^'Po) g-^^^^^^^O'^o) and obtain 

<£, yxeA,yp£B. 



t 

The expansion functions are now 

The expansion coefficients {5^^'^} should obey 

5t^^ - E Pt''{p)m = E e^'^^^*(^--^)/(p) = u^{xt). (3.11) 

4 Algorithm Description 

This section presents our algorithm which combines the expansions introduced in Section 
[3] with the butterfly structure from Section |2] 

1 . Preliminaries. Construct two quadtrees Tx and Tp for X and P as in Figure [T] Each 
leaf node of Tx and Tp is of size 1/N x 1/N. Since X is a regular Cartesian grid, Tx 
is just a uniform hierarchical partition. 

2. Initialization. Set A to be the root of Tx. For each leaf box B G Tp, construct the 
expansion coefficients {Sf^, 1 < t < r^} from (3.9) by setting 



peB 



3. Recursion. For each £ = l,2,...,L/2, construct the coefficients {5/^^, 1 < i < ''e} for 
each pair {A, B) with A at level ^ and B at the complementary level L — i as follows: 
let Ap be A's parent and {-Bo c = 1,2, 3, 4} be S's children. For each child, we have 
available from the previous level an approximation of the form 

u^^ix) « E e^^^'^^^^'Pf'^^S^"''^, Vx e Ap. 
v 

Summing over all children gives 

^""C^) ^ E E e™(^'P"^)5;^''^% V:e G Ap. 

c t' 

Since A C Ap, this is also true for any x £ A. This means that we can treat {S^i^ 
as equivalent sources in B. As explained below, we then set the coefficients {5^^^t 
as 

^AB ^ g- 2«Ar*(xo (A) ,pf) ^ ^ Lf{pf,'') e2'r»^*(^o(A),p^^'=) (4 2) 

c t' 

4. Switch. The interpolant in p may be used as the low-rank approximation as long as 
I < L/2 whereas the interpolant in x is a valid low-rank approximation as soon as 
i > L/2. Therefore, at I = L/2, we need to switch representation. Recall that for 
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i < L/2 the expansion coefficients {5^^ ,1 < t < r J may be regarded as equivalent 
sources while for I > L/2, they approximate the values of the potential u^{x) on the 
Chebyshev grid {x^, 1 < t < r^}. Hence, for any pair (^4, B) with A at level L/2 (and 
likewise for B), we have 6f^ ~ u^{xf) from (3.11) so that we may set 



(4.3) 



(we abuse notations here since {(5/^^} denotes the new set of coefficients and {6^^} 
the older set). 

5. Recursion (end). The rest of the recursion is analogous. For I = L/2 + 1,...,L, 
construct the coefficients {5^^^, 1 < i < ^e} as follows. With {a^^^} and {(3^^} given 



by (3.10), we have 



u 



B 



(x) = ^n^=(x) « j;a^''^^(x) /3,^^=(p)/(p) ^ ^ ^^^''^^(x),^;^''^^ 



Hence, since (J/^'^ should approximate u^ixf) by (3.11), we simply set 



t',c 

Substituing with its value gives the update 

jAB ^ ^ ^2-KiN^{x^,po{B^)) ^ L^''(x^) e-2«A^*(^;^^Po(Bc)) ^Aj,B, j ^ ^^^^^ 



6. Termination. Finally, we reach i = L and set B to be the root box of Tp. For each 
leaf box A of Tx , we have 



where {a^^} is given by (3.10). Hence, for each x G A, we set 

u(x) = e2-^*(-'Po(^)) (l^(x) e-2-^*(-^Po(B)) ^^ ^^ 



In order to justify (4.2), recall that 



< e, \fxeA,\/peB, 



where Pf^{p) is given by (3.8). Summing the above inequality over p £ {p^''}t',c with 



A B 

weights {6^,'' "} gives 



u 



c,t' 
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which means that we can set 



c,t' 



A.r>Bc 



Substituing fif"^ with its value gives the update (4.2) 



The main workload is in (4.2) and (4.4). Because of the tensor product structures, the 
computations in (4.2) and (4.4) can be accelerated by performing Chebyshev interpolation 
one dimension at a time, reducing the number operations from 0{r^) 



0{qt) to 0{q- 



As there are at most 0{N log N) pairs of boxes {A,B), the recursion steps take at most 
0(re^^ A^^ log A^) operations. It is not difficult to see that the remaining steps of the al- 
gorithm take at most 0(r^ A) operations. Hence, with = 0(log^(l/e)), this gives an 
overall complexity estimate of 0(log^(l/e) A'^log A + log'^(l/e) A^). Since the prescribed 
accuracy e is a constant, our algorithm runs in O(A^logA^) time with a constant poly- 
logarithmic in e. Although the dependence of this constant on log(l/e) is quite strong, 
we would like to emphasize that this is only a worst case estimate. In practice, and as 
empirically demonstrated in Section [5] this dependence is rather moderate and grows like 
log(l/e). 



We would like to point out that the values of L^^p^,") in (4.2) and of L^,^{xf-) in (4.4) 



are both translation and level-independent because of the nested structure of the quadtree. 
Therefore, once these values are computed for a single pair {A, B), they can just be reused 
for all pairs visited during the execution of the algorithm. In our implementation, the 



values of L^{p^') in (lO) and Lf,''{xf) are stored in a Kronecker-product form in order to 



facilitate the dimension-wise Chebyshev interpolation discussed in the previous paragraph. 

This algorithm has two main advantages over the approach based on interpolative de- 
composition. First, no precomputation is required. Since the low-rank approximation uses 
Lagrange interpolation on fixed tensor-product Chebyshev grids, the functions {af'^{x)^ 



and (p)} are given explicitly by (3.8) and (3.10). In turn, this yields explicit formulas 



for computing the expansion coefficients {(5t^^}t, compare (4.1), (4.2), and (4.4). Second, 



this algorithm is highly efficient in terms of memory requirement. In the approach based 
on the interpolative decomposition method, one needs to store many linear transformations 
(one for each pair {A,B)) which yields a storage requirement on the order of A^logA 
as observed earlier. The proposed algorithm, however, only needs to store the expansion 
coefficients {8^^}. Moreover, at any point in the execution, only the expansion coefficients 
from two consecutive levels are actually needed. Therefore, the storage requirement is only 
on the order of A^, which allows us to address problems with much larger sizes. 

One advantage of the interpolative decomposition approach is that it often has a smaller 
separation rank. The reason is that the low-rank approximation is optimized for the kernel 
under study and, therefore, the computed rank is usually very close to the true separation 
rank r^. In contrast, our low-rank approximations are based on tensor-product Chebyshev 
grids and merely exploit the smoothness of the function e^'^*^-^ ^^''^^ either in x or in 
p. In particular, it ignores the finer structure of the kernel e^'^*^*(^''P) and as a result, 
the computed separation rank is often significantly higher. Fortunately, this growth in 
the separation rank does not result in a significant increase in the computation time since 
the tensor-product structure and the Lagrangian interpolants dramatically decrease the 
computational cost. 

The tensor-product Chebyshev grid is also used in the method described in Example 2 
of Section |2] There, the equivalent sources are supported on a Chebyshev grid in B and 
are constructed by collocating the potential on another Chebyshev grid in A. Because of 1) 
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the tensor-product nature of the grids and 2) the nature of the Fourier kernel, the matrix 
representation of this collocation procedure has an almost {A, i?)-independent Kronecker 
product decomposition. This offers a way of speeding up the computations and makes it 
unnecessary to store the matrix representation. Unfortunately, such an approach would 
not work for FIOs since the kernel g'^'^^^'^i^'P) does not have an (A, i3)-independent tensor- 
product decomposition. This is why a major difference is that we use tensor-product 
Chebyshev grids only to interpolate the residual kernel e^'^*^^ ^^'P-* in x or p depending 
on which box is smaller. The important point is that we also keep the main benefits of that 
approach. 

Up to this point, we have only been concerned with the computation of FIOs with 
constant amplitudes. However, our approach can easily be extended to the general case 
with variable amplitudes a{x, k) as in 

u{x) = a{x, A;)e2"**(^''^)/(A;), x e X. (4.6) 
ken 

In most applications of interest, a(x, k) is a simple object, i.e. much simpler than the 
oscillatory term g^'^**'^^'^). A possible approach is to follow [3, 2I4 where the amplitude is 
assumed to have a low-rank separated approximation obeying 



a{x,k) - y^gt{x)ht{k) 



t=i 



where the number of terms is independent of N — the size of the grids X and Q. Such 
an approximation can be obtained either analytically or through the randomized procedure 



described in [11^. An algorithm for computing (4.6) may then operate as follows: 

1. Construct the approximation a(x, k) ~ X^j=i gt{x)ht{k) with x ^ X and /c G 

2. Set u{x) = for X G X and for each t = 1, . . . , Sg, 

(a) form the product ft{k) = ht{k)f{k) for /c G il, 

(b) compute Ylk^'^^^^^^'''^ ft(^) ^'-'^ x G X by applying the above algorithm, 

(c) multiply the result with gt{x) for x G X, and add this product to m(x). 

We would like to point out that the above algorithm is presented in a form that is 
conceptually simple. However, when applying the butterfly algorithm to the functions 
{ft{k),t = 1, . . . , Se} in the multiple executions of Step 2(b), the following kernel evalua- 
tions are independent of {ft{k)} and thus performed redundantly: e"^'^*^*^^''^"^-''^?^ and 



(4.4). Therefore, in an efficient implementation of the above algorithm, one should "vec- 
torize" the butterfly algorithm to operate on {ft{k),t = 1, . . . ,Se} simultaneously so that 
redundant kernel evaluations can be avoided. 



5 Numerical Results 

This section provides some numerical results to illustrate the empirical properties of the 
algorithm. The implementation is in C-|— |- and all tests are carried out on a desktop 
computer with a 2.8GHz CPU. 
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When computing the matrix-vector product u{x) = Y2ken^'^^^^^^'^^ fi^)^ indepen- 
dently sample the entries of the input vector {f(k),k £ 17} from the standard normal 
distribution so that the input vector is just white noise. Let {u'^{x),x E X} be the poten- 
tials computed by the algorithm. To report on the accuracy, we select a set 5 of 256 points 
from X and estimate the relative error by 



According to the algorithm description in Section |4j the leafs of the quadtree at level 
L = log2 N are of size x and each contains a small number of points. However, 
when the number of points in a box B is much less than g^, it does not make sense to 
construct the expansion coefficients simply because the sources at these points would 

themselves provide a more compact representation. Thus in practice, the recursion starts 
from the boxes in Tp that are a couple of levels away from the bottom so that each box has 
at least points in it. Similarly, the recursion stops at the boxes in Tx that are a couple 
of levels away from the bottom. In general, the starting and ending levels should depend 
on the value of Qs. In the following examples, we start from level log2 — 3 and stop at 
level 3 in Tp. This choice matches well with the values of (5 to 11) that we use here. 



In our first example, we consider the computation of ( 1.1 ) with the phase function given 

by 

(^^^ _ ^ . I. , ^ /r2frlA-2 + c^(r)k^ ci(x) = (2 + sin(27rxi) sin(2^X2))/3, 

nx, k)-x k + ^c,[x)k, + c,{x}k„ ^^^^^ ^ ^2 + cos(27rxi) cos(27rx2))/3. ^^'^^ 

If g{x) = Y^ken /(A;)e2^*^-'=/^ is the (periodic) inverse Fourier transform of the input, this 
example models the integration of g over ellipses where ci(x) and C2(x) are the axis lengths 
of the ellipse centered at the point x £ X. In truth, the exact formula of this generalized 
Radon transform contains an amplitude term a{x, k) involving Bessel functions of the first 
and second kinds. Nonetheless, we wish to focus on the main computational difficulty, the 
highly oscillatory phase in this example, and simply set the amplitude a(x, k) to one. Table 
[l] summarizes the results of this example for different combinations of the grid size N (the 
grid is A X N) and of the degree of the polynomial interpolation q. 

Next, we use the algorithm described at the end of Section |4] to study the performance 



in the more general setup of variable amplitudes (4.6). The second example is the exact 



formula for integrating over circles with radii c(x) centered at the points x £ X 

u{x) = ^a+(x,A:)e2"**+(^''=)/(^) + a_(x, fc)e2'^**-(^''=V(^) 

where the amplitudes and phases are given by 

a±{x,k) = (Jo(27rc(x)|A:|) ±iyo(2^c(x)|A:|) e^2-ic(x)|fc| 

^±{x,k) = X ■k + c{x)\k\ (5.3) 

c{x) = (3 -|- sin(27rxi) sin(27rx2))/4. 

(Above the functions Jq and Yq are special Bessel functions. The Appendix in |llj details 
the derivation of these formulas). We use the randomized procedure described in [11] to 
construct the low-rank separated approximation for a±{x,k). For an accuracy of le-7. 
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{N, q) 


ra(sec) 


2d (sec) 


Td/Ta 




(256,5) 


6.11e+l 


3.20e+2 


5.24e+0 


1.26e-2 


(512,5 


2.91e+2 


5.59e+3 


1.92e+l 


1.56e-2 


(1024,5) 


1.48e+3 


9.44e+4 


6.37e+l 


1.26e-2 


(2048,5) 


6.57e+3 


1.53e+6 


2.32e+2 


1.75e-2 


(4096,5) 


3.13e+4 


2.43e+7 


7.74e+2 


1.75e-2 


(256,7) 


1.18e+2 


3.25e+2 


2.76e+0 


7.57e-4 


(512,7) 


5.54e+2 


5.47e+3 


9.87e+0 


6.68e-4 


(1024,7) 


2.76e+3 


9.48e+4 


3.44e+l 


6.45e-4 


(2048,7) 


1.23e+4 


1.46e+6 


1.19e+2 


8.39e-4 


(4096,7) 


5.80e+4 


2.31e+7 


3.99e+2 


8.18e-4 


(256,9) 


2.46e+2 


3.10e+2 


1.26e+0 


3.15e-5 


(512,9) 


1.03e+3 


5.19e+3 


5.06e+0 


3.14e-5 


(1024,9) 


4.95e+3 


9.44e+4 


1.91e+l 


3.45e-5 


(2048,9) 


2.21e+4 


1.48e+6 


6.71e+l 


4.01e-5 


(4096,9) 


1.02e+5 


2.23e+7 


2.18e+2 


4.21e-5 


(256,11) 


4.66e+2 


3.07e+2 


6.59e-l 


7.34e-7 


(512,11) 


1.69e+3 


4.53e+3 


2.68e+0 


7.50e-7 


(1024,11) 


8.33e+3 


9.50e+4 


1.14e+l 


5.23e-7 


(2048,11) 


3.48e+4 


1.49e+6 


4.27e+l 


5.26e-7 



Table 1: Computational results with the phase function given by (5.2). N x N is the size 
of the domain; q is the size of the Chebyshev interpolation grid in each dimension; Ta is the 
running time of the algorithm in seconds; is the estimated running time of the direct 
evaluation method and T^/Ta is the speedup factor; finally, £a is the accuracy estimated 
with 



the resulting approximation contains only 3 terms. Table [2] summarizes the results of this 
example for different combinations of N and q. 

From these tables, the first observation is that the accuracy is well controlled by the 
size of the Chebyshev grid, and that the estimated accuracy £a improves on average by a 
factor of 30 every time q is increased by a factor of 2. In practical applications, one often 
specifies the accuracy instead of the grid size q. To adapt to this situation, the quantity 
(5.1) can be used to estimate the error; whenever the error is too large, one can simply 
increase the value of q until the desired accuracy is reached. The second observation is 
that the accuracy decreases only slightly when increases, indicating that the algorithm 
is numerically stable. This is due to the fact that the Lebesgue constant of the Chebyshev 
interpolation is almost optimal, i. e. the Chebyshev interpolation operator has almost the 
minimum operator norm among all Lagrange interpolants of the same order [33] . 

These results show that the empirical running time of our algorithm closely follows the 
0{N'^ log N) asymptotic complexity. Each time we double A^, the size of the grid quadru- 
ples. The corresponding running time and speedup factor increase by a factor roughly 
equal to 4 as well. We note that for large values of which are of interest to us and 
to practitioners, the numerical results show a very substantial speedup factor over direct 
evaluation. For instance, for 4, 096 x 4, 096 grids, we gain three order of magnitudes since 
one can get nearly two digits of accuracy with a speedup factor exceeding 750. 

The article |llj proposed an 0(A^^'^ log A^) approach based on the partitioning of the 
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{N, q) 


Ta(sec) 


2d (sec) 


Td/Ta 




(256,5) 


1.39e+2 


3.20e+3 


2.31e+l 


1.48e-2 


(512,5) 


7.25e+2 


5.20e+4 


7.17e+l 


1.62e-2 


(1024,5) 


3.45e+3 


8.34e+5 


2.42e+2 


1.90e-2 


(256,7) 


2.69e+2 


3.21e+3 


1.19e+l 


4.71e-4 


(512,7) 


1.38e+3 


5.20e+4 


3.78e+l 


7.30e-4 


(1024,7) 


6.43e+3 


8.35e+5 


1.30e+2 


6.35e-4 


(256,9) 


5.23e+2 


3.20e+3 


6.12e+0 


1.59e-5 


(512,9) 


2.49e+3 


5.17e+4 


2.08e+l 


2.97e-5 


(1024,9) 


1.15e+4 


8.32e+5 


7.25e+l 


1.75e-5 


(256,11) 


1.04e+3 


3.18e+3 


3.06e+0 


8.03e-7 


(512,11) 


4.10e+3 


5.11e+4 


1.24e+l 


9.38e-7 


(1024,11) 


1.84e+4 


8.38e+5 


4.57e+l 


8.01e-7 



Table 2: Computational results with the amplitudes and phase functions given by (5.3). 



frequency domain into conical region. Though the time complexity of this former 
algorithm may not be optimal, we showed that it was efficient in parts because its main 
computational component, the nonuniform fast Fourier transform, is highly optimized. 
Comparing Tables 4 and 5 in [Tl|^with the numerical results presented here, we observe 
that both approaches take roughly the same time for N = 256 and 512. For N < 256, the 
approach based on conical partitioning is faster as its complexity has a smaller constant. 
For A'^ > 512, however, the current approach based on the butterfly algorithm clearly 
outperforms our former approach. 

It is straightforward to generalize our algorithm to higher dimensions. In three dimen- 
sions for example, the main modification is to use a three dimensional Chebyshev grid to 
interpolate e^'^*^ (^-fc). Consider again a simple 3D example modeling the integration over 
spheres with varying radii in which the now 6-dimensional phase function $(x, /c), x, A; G M'^, 
is given by 

<^(x, k) = X ■ k + c{x)\k\, c{x) = (3 + sin(27rxi) sin(27rx2) sin(27rx3))/4. 

Our 3D numerical results are reported in Table |3] In this setup, we see that our approach 
offers a significant speedup even for moderate values of A^. 



{N,q) 


Ta{sec) 


Td{sec) 


Td/Ta 




(64,7) 


1.79e+3 


7.33e+3 


4.10e+0 


3.32e-3 


(128,7) 


1.58e+4 


4.77e+5 


3.02e+l 


4.06e-3 


(256,7) 


1.44e+5 


2.97e+7 


2.06e+2 


3.96e-3 



Table 3: Computational results in 3 dimensions with the phase function given by (5.4). 



^The results in Tables 4 and 5 of [11] were obtained on a desktop with a 2.6GHz CPU, which is slightly 
slower yet comparable to the computer used for the tests in this section. The implementation of the 
nonuniform fast Fourier transform in [11] was written in C++ and complied as a MEX-function. Finally, 
the two examples are not exactly similar but this slight difference is unessential. 
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6 Conclusions and Discussions 



This paper introduced a novel and accurate algorithm for evaluating discrete FIOs. Under- 
lying this approach is a key mathematical property, which says that the kernel, restricted to 
special subdomains in time and frequency, is approximately of very low-rank. Our strategy 
operationalizes this fact by using a multiscale partitioning of the time and frequency domain 
together with the butterfly structure to achieve an 0{N'^ log N) asymptotic complexity. 

A different way to achieve a near-optimal 0{N^ log N) complexity might be to use the 
curvelet transform |12| ITU] of Candes and Donoho, or the wave atoms [20] of Demanet and 
Ying. In [HI [9], Candes and Demanet proved that the curvelet representation of FIOs is 
optimally sparse (the wave atom representation also offers the same optimality), a property 
which relies on the role played by the second dyadic decomposition of Stein and his col- 
laborators [37] . Whether one can operationalize this mathematical insight into an efficient 
algorithm seems an interesting direction for future research. 

The geometric low-rank property together with the butterfly algorithm appear to be a 
very powerful combination to obtain fast algorithms for computing certain types of highly 
oscillatory integrals. We already discussed the work of O'Neil and Rokhlin [31] who have 
used the butterfly algorithm to design fast special transforms, and of Ying who has extended 
this approach to develop fast algorithms for Fourier transforms with sparse data [39^ and 
Fourier transforms with summation constraints [40 . Clearly, it would be of interest to 
identify wide classes of problems for which this general approach may prove powerful. 
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